Self-Organizing n-gram Model for Automatic Word Spacing
نویسندگان
چکیده
An automatic word spacing is one of the important tasks in Korean language processing and information retrieval. Since there are a number of confusing cases in word spacing of Korean, there are some mistakes in many texts including news articles. This paper presents a high-accurate method for automatic word spacing based on self-organizing -gram model. This method is basically a variant of -gram model, but achieves high accuracy by automatically adapting context size. In order to find the optimal context size, the proposed method automatically increases the context size when the contextual distribution after increasing it dose not agree with that of the current context. It also decreases the context size when the distribution of reduced context is similar to that of the current context. This approach achieves high accuracy by considering higher dimensional data in case of necessity, and the increased computational cost are compensated by the reduced context size. The experimental results show that the self-organizing structure of -gram model enhances the basic model.
منابع مشابه
Kohonen Self Organizing for Automatic Identification of Cartographic Objects
Automatic identification and localization of cartographic objects in aerial and satellite images have gained increasing attention in recent years in digital photogrammetry and remote sensing. Although the automatic extraction of man made objects in essence is still an unresolved issue, the man made objects can be extracted from aerial photos and satellite images. Recently, the high-resolution s...
متن کاملAutomatic n-gram language model creation from web resources
This paper describes an automatic building of N-gram language models from Web texts for large vocabulary continuous speech recognition. Although a huge amount of well-formed texts are needed to train a model, collecting and organizing such text corpus for every task by hand needs a great labor. We need the language model to update frequently to cover the current topics. To deal with this proble...
متن کاملA Joint Statistical Model for Simultaneous Word Spacing and Spelling Error Correction for Korean
This paper presents noisy-channel based Korean preprocessor system, which corrects word spacing and typographical errors. The proposed algorithm corrects both errors simultaneously. Using Eojeol transition pattern dictionary and statistical data such as Eumjeol n-gram and Jaso transition probabilities, the algorithm minimizes the usage of huge word dictionaries.
متن کاملA Comparison of Support Vector Machines and Self-Organizing Maps for e-Mail Categorization
This paper reports on experiments in multi-class document categorization with support vector machines and self-organizing maps. A data set consisting of personal e-mail messages is used for the experiments. Two distinct document representation formalisms are employed to characterize these messages, namely a standard word-based approach and a character n-gram document representation. Based on th...
متن کاملLandforms identification using neural network-self organizing map and SRTM data
During an 11 days mission in February 2000 the Shuttle Radar Topography Mission (SRTM) collected data over 80% of the Earth's land surface, for all areas between 60 degrees N and 56 degrees S latitude. Since SRTM data became available, many studies utilized them for application in topography and morphometric landscape analysis. Exploiting SRTM data for recognition and extraction of topographic ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006